In-Class Exercise 5

1 Overview

In this In-Class Exercise, we will demonstrate the basic concepts and methods of logistic regression specially designed for geographical data. In particular, we will demonstrate the following:

  • explain the similarities and differences between Logistic Regression (LR) algorithm versus geographical weighted Logistic Regression (GWLR) algorithm.

  • calibrate predictive models by using appropriate Geographically Weighted Logistic Regression algorithm for geographical data.

2 The Data

In this exercise, we will analyse the data from Nigeria. There are 2 datasets used, as outlined in sections 2.1 and 2.2. We will have chosen Osun for this analysis as this state has a relatively high proportion of non-functional water points compared to the other states in Nigeria.

2.1 Aspatial Data

Data was downloaded from WPdx Global Data Repositories in a csv format. The WPdx+ data set was filtered for “nigeria” in the column clean_country_name before downloading. There is a total of 95,008 unique water point records.

2.2 Geospatial Data

Nigeria Level-2 Administrative Boundary (also known as Local Government Area, LGA) polygon features GIS data was downloaded from geoBoundaries.

3 Getting Started

The R packages needed for this exercise are as follows:

pacman::p_load(sf, tidyverse, funModeling, blorr, corrplot, ggpubr, spdep, GWmodel, tmap, skimr, caret)

4 Importing the Analytical Data

In the following code chunk, we will import Osun.rds and Osun_wp_sf.rds that have been previously tidied by using read_rds(). In Osun.rds, we have kept the geographical boundary for Osun state to allow for better plotting later.

Osun <- read_rds("rds/Osun.rds")
Osun_wp_sf <- read_rds("rds/Osun_wp_sf.rds")

We check our independent variable i.e. status by running the following code chunk. We can see that it is a binary data - TRUE representing functional water points and FALSE representing non-functional water points. We can see that there are 55.5% functional water points and 44.5% non-functional water points.

Osun_wp_sf %>% 
    freq(input = 'status')
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the funModeling package.
  Please report the issue at <https://github.com/pablo14/funModeling/issues>.

  status frequency percentage cumulative_perc
1   TRUE      2642       55.5            55.5
2  FALSE      2118       44.5           100.0
tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(Osun)+
    tm_polygons(alpha=0.4)+
    tm_shape(Osun_wp_sf)+
    tm_dots(col="status",
            alpha=0.6)+
    tm_view(set.zoom.limits = c(9,12))

5 Exploratory Data Analysis

Regression models are very sensitive to excessive number of missing values (e.g. fields with more than 20-50% missing values, depending on the sample size). In this section, we will look at distribution of the variables. We will use skimr() which will allow the results to be displayed in a nice report.

    Osun_wp_sf %>% 
    skim()
Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
provided. Falling back to `character`.
Data summary
Name Piped data
Number of rows 4760
Number of columns 75
_______________________
Column type frequency:
character 47
logical 5
numeric 23
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1.00 5 44 0 2 0
report_date 0 1.00 22 22 0 42 0
status_id 0 1.00 2 7 0 3 0
water_source_clean 0 1.00 8 22 0 3 0
water_source_category 0 1.00 4 6 0 2 0
water_tech_clean 24 0.99 9 23 0 3 0
water_tech_category 24 0.99 9 15 0 2 0
facility_type 0 1.00 8 8 0 1 0
clean_country_name 0 1.00 7 7 0 1 0
clean_adm1 0 1.00 3 5 0 5 0
clean_adm2 0 1.00 3 14 0 35 0
clean_adm3 4760 0.00 NA NA 0 0 0
clean_adm4 4760 0.00 NA NA 0 0 0
installer 4760 0.00 NA NA 0 0 0
management_clean 1573 0.67 5 37 0 7 0
status_clean 0 1.00 9 32 0 7 0
pay 0 1.00 2 39 0 7 0
fecal_coliform_presence 4760 0.00 NA NA 0 0 0
subjective_quality 0 1.00 18 20 0 4 0
activity_id 4757 0.00 36 36 0 3 0
scheme_id 4760 0.00 NA NA 0 0 0
wpdx_id 0 1.00 12 12 0 4760 0
notes 0 1.00 2 96 0 3502 0
orig_lnk 4757 0.00 84 84 0 1 0
photo_lnk 41 0.99 84 84 0 4719 0
country_id 0 1.00 2 2 0 1 0
data_lnk 0 1.00 79 96 0 2 0
water_point_history 0 1.00 142 834 0 4750 0
clean_country_id 0 1.00 3 3 0 1 0
country_name 0 1.00 7 7 0 1 0
water_source 0 1.00 8 30 0 4 0
water_tech 0 1.00 5 37 0 20 0
adm2 0 1.00 3 14 0 33 0
adm3 4760 0.00 NA NA 0 0 0
management 1573 0.67 5 47 0 7 0
adm1 0 1.00 4 5 0 4 0
New Georeferenced Column 0 1.00 16 35 0 4760 0
lat_lon_deg 0 1.00 13 32 0 4760 0
public_data_source 0 1.00 84 102 0 2 0
converted 0 1.00 53 53 0 1 0
created_timestamp 0 1.00 22 22 0 2 0
updated_timestamp 0 1.00 22 22 0 2 0
Geometry 0 1.00 33 37 0 4760 0
ADM2_EN 0 1.00 3 14 0 30 0
ADM2_PCODE 0 1.00 8 8 0 30 0
ADM1_EN 0 1.00 4 4 0 1 0
ADM1_PCODE 0 1.00 5 5 0 1 0

Variable type: logical

skim_variable n_missing complete_rate mean count
rehab_year 4760 0 NaN :
rehabilitator 4760 0 NaN :
is_urban 0 1 0.39 FAL: 2884, TRU: 1876
latest_record 0 1 1.00 TRU: 4760
status 0 1 0.56 TRU: 2642, FAL: 2118

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
row_id 0 1.00 68550.48 10216.94 49601.00 66874.75 68244.50 69562.25 471319.00 ▇▁▁▁▁
lat_deg 0 1.00 7.68 0.22 7.06 7.51 7.71 7.88 8.06 ▁▂▇▇▇
lon_deg 0 1.00 4.54 0.21 4.08 4.36 4.56 4.71 5.06 ▃▆▇▇▂
install_year 1144 0.76 2008.63 6.04 1917.00 2006.00 2010.00 2013.00 2015.00 ▁▁▁▁▇
fecal_coliform_value 4760 0.00 NaN NA NA NA NA NA NA
distance_to_primary_road 0 1.00 5021.53 5648.34 0.01 719.36 2972.78 7314.73 26909.86 ▇▂▁▁▁
distance_to_secondary_road 0 1.00 3750.47 3938.63 0.15 460.90 2554.25 5791.94 19559.48 ▇▃▁▁▁
distance_to_tertiary_road 0 1.00 1259.28 1680.04 0.02 121.25 521.77 1834.42 10966.27 ▇▂▁▁▁
distance_to_city 0 1.00 16663.99 10960.82 53.05 7930.75 15030.41 24255.75 47934.34 ▇▇▆▃▁
distance_to_town 0 1.00 16726.59 12452.65 30.00 6876.92 12204.53 27739.46 44020.64 ▇▅▃▃▂
rehab_priority 2654 0.44 489.33 1658.81 0.00 7.00 91.50 376.25 29697.00 ▇▁▁▁▁
water_point_population 4 1.00 513.58 1458.92 0.00 14.00 119.00 433.25 29697.00 ▇▁▁▁▁
local_population_1km 4 1.00 2727.16 4189.46 0.00 176.00 1032.00 3717.00 36118.00 ▇▁▁▁▁
crucialness_score 798 0.83 0.26 0.28 0.00 0.07 0.15 0.35 1.00 ▇▃▁▁▁
pressure_score 798 0.83 1.46 4.16 0.00 0.12 0.41 1.24 93.69 ▇▁▁▁▁
usage_capacity 0 1.00 560.74 338.46 300.00 300.00 300.00 1000.00 1000.00 ▇▁▁▁▅
days_since_report 0 1.00 2692.69 41.92 1483.00 2688.00 2693.00 2700.00 4645.00 ▁▇▁▁▁
staleness_score 0 1.00 42.80 0.58 23.13 42.70 42.79 42.86 62.66 ▁▁▇▁▁
location_id 0 1.00 235865.49 6657.60 23741.00 230638.75 236199.50 240061.25 267454.00 ▁▁▁▁▇
cluster_size 0 1.00 1.05 0.25 1.00 1.00 1.00 1.00 4.00 ▇▁▁▁▁
lat_deg_original 4760 0.00 NaN NA NA NA NA NA NA
lon_deg_original 4760 0.00 NaN NA NA NA NA NA NA
count 0 1.00 1.00 0.00 1.00 1.00 1.00 1.00 1.00 ▁▁▇▁▁

We can also see the number of missing values in each field. For instance, install_year has 1144 missing values out of a total of 4760 records for Osun. In this case, we cannot use this field (24% missing values) although it is a useful variable since we know water points that are beyond 8-9 years are more likely to be non-functional.

On the other hand, we can see that fields local_population_1km and water_point_population both have 4 missing values, which is a small number of records, and hence we can still use these 2 fields.

In the following code chunk, we will select the independent variables that we will use for our regression model and to exclude records that have missing values. We will use as.factor() for usage_capacity as there are only specific values for the capacity of the water points, hence, we should not treat this field as a continuous variables and instead, make the values as factor.

Osun_wp_sf_clean <- Osun_wp_sf %>% 
    filter_at(vars(status,
                   distance_to_primary_road,
                   distance_to_secondary_road,
                   distance_to_tertiary_road,
                   distance_to_city,
                   distance_to_town,
                   water_point_population,
                   local_population_1km,
                   usage_capacity,
                   is_urban,
                   water_source_clean),
              all_vars(!is.na(.))) %>% 
    mutate(usage_capacity = as.factor(usage_capacity))

6 Correlation Analysis

In this section, we want to know if any of the numerical independent variables are correlated. We will first need to drop the geometry column in the spatial data Osun_wp_sf_clean so that the geometry field does not interfere with correlation analysis.

Osun_wp <- Osun_wp_sf_clean %>% 
    select(c(7,35:39,42:43,46:47,57)) %>% 
    st_set_geometry(NULL)

We can then perform correlation analysis only on the numerical variables.

cluster_vars.cor = cor(
    Osun_wp[,2:7])
corrplot.mixed(cluster_vars.cor,
               lower = "ellipse",
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

We can see that there is no multicollinearity observed among the numerical variables (no coefficient greater than 0.8).

7 Building a Logistic Regression Model

In the following code check, we will use glm of R to calibrate a logistic regression model for the water point status.

model <- glm(status ~ distance_to_primary_road+
                       distance_to_secondary_road+
                       distance_to_tertiary_road+
                       distance_to_city+
                       distance_to_town+
                       is_urban+
                       usage_capacity+
                       water_source_clean+
                       water_point_population+
                       local_population_1km,
                   data = Osun_wp_sf_clean,
                   family = binomial(link = 'logit'))

In the results for model, we can see that fitted.values are all probability values which is our y-hat.

We will then use blr_regress() from blorr package to generate a report for the model results.

blr_regress(model)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4744           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.3887        0.1124      3.4588       5e-04 
        distance_to_primary_road            1      0.0000        0.0000     -0.7153      0.4744 
       distance_to_secondary_road           1      0.0000        0.0000     -0.5530      0.5802 
       distance_to_tertiary_road            1      1e-04         0.0000      4.6708      0.0000 
            distance_to_city                1      0.0000        0.0000     -4.7574      0.0000 
            distance_to_town                1      0.0000        0.0000     -4.9170      0.0000 
              is_urbanTRUE                  1     -0.2971        0.0819     -3.6294       3e-04 
           usage_capacity1000               1     -0.6230        0.0697     -8.9366      0.0000 
water_source_cleanProtected Shallow Well    1      0.5040        0.0857      5.8783      0.0000 
   water_source_cleanProtected Spring       1      1.2882        0.4388      2.9359      0.0033 
         water_point_population             1      -5e-04        0.0000    -11.3686      0.0000 
          local_population_1km              1      3e-04         0.0000     19.2953      0.0000 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7347          Somers' D        0.4693   
% Discordant          0.2653          Gamma            0.4693   
% Tied                0.0000          Tau-a            0.2318   
Pairs                5585188          c                0.7347   
---------------------------------------------------------------

We can see that distance_to_primary_road and distance_to_secondary_road has p-value greater than 0.05, we will subsequently exclude these 2 fields which are not statistically significant (i.e. p-value < 0.05). For categorical variables, a positive value indicates an average correlation and a negative value implies a below average correlation.

For continuous variables, a positive value implies a direct correlation and a negative value implies an inverse relation, while the magnitude of the coefficient represents the strength of the correlations. We will make this analysis for the continuous variables after we have confirmed that they are statistically significant.

In the code chunk below, blr_confusion_matrix() from blorr package to prepare a confusion matrix. We will use cutoff = 0.5, this means that if the fitted.values sf greater than 0.5, we will label the water point as functional, and if the fitted.values determined is less than 0.5, we will label the water point as non-functional. (The validity of the cutoff is measured using accuracy, sensitivity, and specificity).

blr_confusion_matrix(model, cutoff = 0.5)
Confusion Matrix and Statistics 

          Reference
Prediction FALSE TRUE
         0  1301  738
         1   813 1904

                Accuracy : 0.6739 
     No Information Rate : 0.4445 

                   Kappa : 0.3373 

McNemars's Test P-Value  : 0.0602 

             Sensitivity : 0.7207 
             Specificity : 0.6154 
          Pos Pred Value : 0.7008 
          Neg Pred Value : 0.6381 
              Prevalence : 0.5555 
          Detection Rate : 0.4003 
    Detection Prevalence : 0.5713 
       Balanced Accuracy : 0.6680 
               Precision : 0.7008 
                  Recall : 0.7207 

        'Positive' Class : 1

We can also see that our accruracy is approximately 67.4%. The sensitivity is higher than the specificity, indicating that our true positive is higher (correctly determined approximately 72% true positive) than the true negative (model correctly determines approximately 62% true negative).

8 Building Geographically Weighted Logistic Regression (gwLR) Models

8.1 Converting from sf to sp data frame

We will use select() from dplyr package to select the variables fo interest. We will convert the data to SpatialPointsDataFrame data type for compatibility with subsequent packages. We will need to use Osun_wp_sf_clean which excludes the 4 polygons with missing values as polygons with missing values will cause an error prompt.

Osun_wp_sp <- Osun_wp_sf_clean %>% 
    select(c(status,
             distance_to_primary_road,
             distance_to_secondary_road,
             distance_to_tertiary_road,
             distance_to_city,
             distance_to_town,
             water_point_population,
             local_population_1km,
             is_urban,
             usage_capacity,
             water_source_clean)) %>% 
    as_Spatial()
Osun_wp_sp
class       : SpatialPointsDataFrame 
features    : 4756 
extent      : 182502.4, 290751, 340054.1, 450905.3  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=4 +lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +towgs84=-92,-93,122,0,0,0,0 +units=m +no_defs 
variables   : 11
names       : status, distance_to_primary_road, distance_to_secondary_road, distance_to_tertiary_road, distance_to_city, distance_to_town, water_point_population, local_population_1km, is_urban, usage_capacity, water_source_clean 
min values  :      0,        0.014461356813335,          0.152195902540837,         0.017815121653488, 53.0461399623541, 30.0019777713073,                      0,                    0,        0,           1000,           Borehole 
max values  :      1,         26909.8616132094,           19559.4793799085,          10966.2705628969,  47934.343603562, 44020.6393368124,                  29697,                36118,        1,            300,   Protected Spring 

Since our geometry data is already in projected coordinate format, we can set the longlat as FALSE (the following result will match the SI unit of the projected coordinate system). We will set the argument adaptive to FALSE which indicates that we are interested to compute the fixed bandwidth. We will leave all variables including distance_to_primary_road and distance_to_secondary_road in the following code chunk.

8.2 Building Fixed Bandwidth GWR Model

We can plot a basic gwlr using the bandwidth obtained earlier.

8.2.1 Computing Fixed Bandwidth

bw.fixed <- bw.ggwr(status ~ distance_to_primary_road+
                           distance_to_secondary_road+
                           distance_to_tertiary_road+
                           distance_to_city+
                           distance_to_town+
                           is_urban+
                           usage_capacity+
                           water_source_clean+
                           water_point_population+
                           local_population_1km,
                   data = Osun_wp_sp,
                   family = "binomial",
                   approach = "AIC",
                   kernel = "gaussian",
                   adaptive = FALSE,
                   longlat = FALSE)
Take a cup of tea and have a break, it will take a few minutes.
          -----A kind suggestion from GWmodel development group
 Iteration    Log-Likelihood:(With bandwidth:  95768.67 )
=========================
       0        -2889 
       1        -2836 
       2        -2830 
       3        -2829 
       4        -2829 
       5        -2829 
Fixed bandwidth: 95768.67 AICc value: 5684.357 
 Iteration    Log-Likelihood:(With bandwidth:  59200.13 )
=========================
       0        -2875 
       1        -2818 
       2        -2810 
       3        -2808 
       4        -2808 
       5        -2808 
Fixed bandwidth: 59200.13 AICc value: 5646.785 
 Iteration    Log-Likelihood:(With bandwidth:  36599.53 )
=========================
       0        -2847 
       1        -2781 
       2        -2768 
       3        -2765 
       4        -2765 
       5        -2765 
       6        -2765 
Fixed bandwidth: 36599.53 AICc value: 5575.148 
 Iteration    Log-Likelihood:(With bandwidth:  22631.59 )
=========================
       0        -2798 
       1        -2719 
       2        -2698 
       3        -2693 
       4        -2693 
       5        -2693 
       6        -2693 
Fixed bandwidth: 22631.59 AICc value: 5466.883 
 Iteration    Log-Likelihood:(With bandwidth:  13998.93 )
=========================
       0        -2720 
       1        -2622 
       2        -2590 
       3        -2581 
       4        -2580 
       5        -2580 
       6        -2580 
       7        -2580 
Fixed bandwidth: 13998.93 AICc value: 5324.578 
 Iteration    Log-Likelihood:(With bandwidth:  8663.649 )
=========================
       0        -2601 
       1        -2476 
       2        -2431 
       3        -2419 
       4        -2417 
       5        -2417 
       6        -2417 
       7        -2417 
Fixed bandwidth: 8663.649 AICc value: 5163.61 
 Iteration    Log-Likelihood:(With bandwidth:  5366.266 )
=========================
       0        -2436 
       1        -2268 
       2        -2194 
       3        -2167 
       4        -2161 
       5        -2161 
       6        -2161 
       7        -2161 
       8        -2161 
       9        -2161 
Fixed bandwidth: 5366.266 AICc value: 4990.587 
 Iteration    Log-Likelihood:(With bandwidth:  3328.371 )
=========================
       0        -2157 
       1        -1922 
       2        -1802 
       3        -1739 
       4        -1713 
       5        -1713 
Fixed bandwidth: 3328.371 AICc value: 4798.288 
 Iteration    Log-Likelihood:(With bandwidth:  2068.882 )
=========================
       0        -1751 
       1        -1421 
       2        -1238 
       3        -1133 
       4        -1084 
       5        -1084 
Fixed bandwidth: 2068.882 AICc value: 4837.017 
 Iteration    Log-Likelihood:(With bandwidth:  4106.777 )
=========================
       0        -2297 
       1        -2095 
       2        -1997 
       3        -1951 
       4        -1938 
       5        -1936 
       6        -1936 
       7        -1936 
       8        -1936 
Fixed bandwidth: 4106.777 AICc value: 4873.161 
 Iteration    Log-Likelihood:(With bandwidth:  2847.289 )
=========================
       0        -2036 
       1        -1771 
       2        -1633 
       3        -1558 
       4        -1525 
       5        -1525 
Fixed bandwidth: 2847.289 AICc value: 4768.192 
 Iteration    Log-Likelihood:(With bandwidth:  2549.964 )
=========================
       0        -1941 
       1        -1655 
       2        -1503 
       3        -1417 
       4        -1378 
       5        -1378 
Fixed bandwidth: 2549.964 AICc value: 4762.212 
 Iteration    Log-Likelihood:(With bandwidth:  2366.207 )
=========================
       0        -1874 
       1        -1573 
       2        -1410 
       3        -1316 
       4        -1274 
       5        -1274 
Fixed bandwidth: 2366.207 AICc value: 4773.081 
 Iteration    Log-Likelihood:(With bandwidth:  2663.532 )
=========================
       0        -1979 
       1        -1702 
       2        -1555 
       3        -1474 
       4        -1438 
       5        -1438 
Fixed bandwidth: 2663.532 AICc value: 4762.568 
 Iteration    Log-Likelihood:(With bandwidth:  2479.775 )
=========================
       0        -1917 
       1        -1625 
       2        -1468 
       3        -1380 
       4        -1339 
       5        -1339 
Fixed bandwidth: 2479.775 AICc value: 4764.294 
 Iteration    Log-Likelihood:(With bandwidth:  2593.343 )
=========================
       0        -1956 
       1        -1674 
       2        -1523 
       3        -1439 
       4        -1401 
       5        -1401 
Fixed bandwidth: 2593.343 AICc value: 4761.813 
 Iteration    Log-Likelihood:(With bandwidth:  2620.153 )
=========================
       0        -1965 
       1        -1685 
       2        -1536 
       3        -1453 
       4        -1415 
       5        -1415 
Fixed bandwidth: 2620.153 AICc value: 4761.89 
 Iteration    Log-Likelihood:(With bandwidth:  2576.774 )
=========================
       0        -1950 
       1        -1667 
       2        -1515 
       3        -1431 
       4        -1393 
       5        -1393 
Fixed bandwidth: 2576.774 AICc value: 4761.889 
 Iteration    Log-Likelihood:(With bandwidth:  2603.584 )
=========================
       0        -1960 
       1        -1678 
       2        -1528 
       3        -1445 
       4        -1407 
       5        -1407 
Fixed bandwidth: 2603.584 AICc value: 4761.813 
 Iteration    Log-Likelihood:(With bandwidth:  2609.913 )
=========================
       0        -1962 
       1        -1680 
       2        -1531 
       3        -1448 
       4        -1410 
       5        -1410 
Fixed bandwidth: 2609.913 AICc value: 4761.831 
 Iteration    Log-Likelihood:(With bandwidth:  2599.672 )
=========================
       0        -1958 
       1        -1676 
       2        -1526 
       3        -1443 
       4        -1405 
       5        -1405 
Fixed bandwidth: 2599.672 AICc value: 4761.809 
 Iteration    Log-Likelihood:(With bandwidth:  2597.255 )
=========================
       0        -1957 
       1        -1675 
       2        -1525 
       3        -1441 
       4        -1403 
       5        -1403 
Fixed bandwidth: 2597.255 AICc value: 4761.809 
bw.fixed
[1] 2599.672

We obtain a fixed bandwidth of 2599.672 m (projection in Nigeria is in metres), which is approximately 2.6 km.

8.2.2 Building fixed bandwidth model

gwlr.fixed <- ggwr.basic(status ~ distance_to_primary_road+
                           distance_to_secondary_road+
                           distance_to_tertiary_road+
                           distance_to_city+
                           distance_to_town+
                           is_urban+
                           usage_capacity+
                           water_source_clean+
                           water_point_population+
                           local_population_1km,
                   data = Osun_wp_sp,
                   bw = bw.fixed,
                   family = "binomial",
                   kernel = "gaussian",
                   adaptive = FALSE,
                   longlat = FALSE)
 Iteration    Log-Likelihood
=========================
       0        -1958 
       1        -1676 
       2        -1526 
       3        -1443 
       4        -1405 
       5        -1405 
gwlr.fixed
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2022-12-18 00:03:15 
   Call:
   ggwr.basic(formula = status ~ distance_to_primary_road + distance_to_secondary_road + 
    distance_to_tertiary_road + distance_to_city + distance_to_town + 
    is_urban + usage_capacity + water_source_clean + water_point_population + 
    local_population_1km, data = Osun_wp_sp, bw = bw.fixed, family = "binomial", 
    kernel = "gaussian", adaptive = FALSE, longlat = FALSE)

   Dependent (y) variable:  status
   Independent variables:  distance_to_primary_road distance_to_secondary_road distance_to_tertiary_road distance_to_city distance_to_town is_urban usage_capacity water_source_clean water_point_population local_population_1km
   Number of data points: 4756
   Used family: binomial
   ***********************************************************************
   *              Results of Generalized linear Regression               *
   ***********************************************************************

Call:
NULL

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-124.555    -1.755     1.072     1.742    34.333  

Coefficients:
                                           Estimate Std. Error z value Pr(>|z|)
Intercept                                 3.887e-01  1.124e-01   3.459 0.000543
distance_to_primary_road                 -4.642e-06  6.490e-06  -0.715 0.474422
distance_to_secondary_road               -5.143e-06  9.299e-06  -0.553 0.580230
distance_to_tertiary_road                 9.683e-05  2.073e-05   4.671 3.00e-06
distance_to_city                         -1.686e-05  3.544e-06  -4.757 1.96e-06
distance_to_town                         -1.480e-05  3.009e-06  -4.917 8.79e-07
is_urbanTRUE                             -2.971e-01  8.185e-02  -3.629 0.000284
usage_capacity1000                       -6.230e-01  6.972e-02  -8.937  < 2e-16
water_source_cleanProtected Shallow Well  5.040e-01  8.574e-02   5.878 4.14e-09
water_source_cleanProtected Spring        1.288e+00  4.388e-01   2.936 0.003325
water_point_population                   -5.097e-04  4.484e-05 -11.369  < 2e-16
local_population_1km                      3.451e-04  1.788e-05  19.295  < 2e-16
                                            
Intercept                                ***
distance_to_primary_road                    
distance_to_secondary_road                  
distance_to_tertiary_road                ***
distance_to_city                         ***
distance_to_town                         ***
is_urbanTRUE                             ***
usage_capacity1000                       ***
water_source_cleanProtected Shallow Well ***
water_source_cleanProtected Spring       ** 
water_point_population                   ***
local_population_1km                     ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6534.5  on 4755  degrees of freedom
Residual deviance: 5688.0  on 4744  degrees of freedom
AIC: 5712

Number of Fisher Scoring iterations: 5


 AICc:  5712.099
 Pseudo R-square value:  0.1295351
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Fixed bandwidth: 2599.672 
   Regression points: the same locations as observations are used.
   Distance metric: A distance matrix is specified for this model calibration.

   ************Summary of Generalized GWR coefficient estimates:**********
                                                   Min.     1st Qu.      Median
   Intercept                                -8.7228e+02 -4.9955e+00  1.7600e+00
   distance_to_primary_road                 -1.9389e-02 -4.8031e-04  2.9618e-05
   distance_to_secondary_road               -1.5921e-02 -3.7551e-04  1.2317e-04
   distance_to_tertiary_road                -1.5618e-02 -4.2368e-04  7.6179e-05
   distance_to_city                         -1.8416e-02 -5.6217e-04 -1.2726e-04
   distance_to_town                         -2.2411e-02 -5.7283e-04 -1.5155e-04
   is_urbanTRUE                             -1.9790e+02 -4.2908e+00 -1.6864e+00
   usage_capacity1000                       -2.0772e+01 -9.7231e-01 -4.1592e-01
   water_source_cleanProtected.Shallow.Well -2.0789e+01 -4.5190e-01  5.3340e-01
   water_source_cleanProtected.Spring       -5.2235e+02 -5.5977e+00  2.5441e+00
   water_point_population                   -5.2208e-02 -2.2767e-03 -9.8875e-04
   local_population_1km                     -1.2698e-01  4.9952e-04  1.0638e-03
                                                3rd Qu.      Max.
   Intercept                                 1.2763e+01 1073.2154
   distance_to_primary_road                  4.8443e-04    0.0142
   distance_to_secondary_road                6.0692e-04    0.0258
   distance_to_tertiary_road                 6.6814e-04    0.0128
   distance_to_city                          2.3718e-04    0.0150
   distance_to_town                          1.9271e-04    0.0224
   is_urbanTRUE                              1.2841e+00  744.3097
   usage_capacity1000                        3.0322e-01    5.9281
   water_source_cleanProtected.Shallow.Well  1.7849e+00   67.6343
   water_source_cleanProtected.Spring        6.7663e+00  317.4123
   water_point_population                    5.0102e-04    0.1309
   local_population_1km                      1.8157e-03    0.0392
   ************************Diagnostic information*************************
   Number of data points: 4756 
   GW Deviance: 2795.084 
   AIC : 4414.606 
   AICc : 4747.423 
   Pseudo R-square value:  0.5722559 

   ***********************************************************************
   Program stops at: 2022-12-18 00:04:11 

From the results, we can see that the Geographically Weighted Regression model has a lower AIC compared to the Generalised Linear Regression. We cannot use the AICc because the global model (Generalised Linear Regression, which does not have geographical information) does not have AICc. This tells us that Geographically Weighted Regression model has improved explainability.

8.3 Model Assessment

8.3.1 Converting SDF into sf data.frame

To assess the model, we will first convert the model into SFD object as data.frame using the following code chunk.

gwr.fixed <- as.data.frame(gwlr.fixed$SDF)

Next, we will label the yhat values (i.e. predicted probability) greater or equal to 0.5 into 1 and else 0. The result of the logic comparison operation will be saved into a field called most.

gwr.fixed <- gwr.fixed %>% 
    mutate(most = ifelse(
        gwr.fixed$yhat >= 0.5, T, F))

We will use confusionMatrix() from caret package to generate the confusion matrix. We define data argument to be the predicted probability and reference argument to be the actual label (i.e. ground truth).

gwr.fixed$y <- as.factor(gwr.fixed$y)
gwr.fixed$most <- as.factor(gwr.fixed$most)
CM <- confusionMatrix(data=gwr.fixed$most, reference = gwr.fixed$y)
CM
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE  1824  263
     TRUE    290 2379
                                          
               Accuracy : 0.8837          
                 95% CI : (0.8743, 0.8927)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7642          
                                          
 Mcnemar's Test P-Value : 0.2689          
                                          
            Sensitivity : 0.8628          
            Specificity : 0.9005          
         Pos Pred Value : 0.8740          
         Neg Pred Value : 0.8913          
             Prevalence : 0.4445          
         Detection Rate : 0.3835          
   Detection Prevalence : 0.4388          
      Balanced Accuracy : 0.8816          
                                          
       'Positive' Class : FALSE           
                                          

When we compare the overall accuracy is now improved to 88.37% (geographically weighted) compared to the 67.39% that we obtained initially in the non-geographically weighted. In addition, the sensitivity improved from 72% to 86%. Also specificity improved from 62% to 90%. This implies we should apply a local strategy (looking at surrounding neighbours) instead of a global strategy to understand the factors for water points being functional or non-functional.

(Note that for the global model, you can see the coefficients of each independent variable. But we do not see this for the local geographically weighted model because one model is built for each state, and hence there are over 4000+ of such coefficients for each variable).

Osun_wp_sf_selected <- Osun_wp_sf_clean %>% 
    select(c(ADM2_EN, ADM2_PCODE,
             ADM1_EN, ADM1_PCODE,
             status))
gwr_sf.fixed <- cbind(Osun_wp_sf_selected, gwr.fixed)

8.4 Visualising gwLR

The following code chunk below is used to create an interactive point symbol map to compare the actual status of the water points against the status of the water points predicted by gwLR.

tmap_mode("view")
tmap mode set to interactive viewing
prob_T <- tm_shape(Osun)+
    tm_polygons(alpha = 0.1)+
    tm_shape(gwr_sf.fixed)+
    tm_dots(col = "yhat",
            border.col = "gray60",
            border.lwd = 1)+
    tm_view(set.zoom.limits = c(8,14))
prob_T

In the following, we will visualise how the standard error of coefficient and t-value for the field distance_to_tertiary_road differs for the local models obtained for each LGA in Osun.

The standard error of the coefficient measures how precisely the model estimates the coefficient’s unknown value. The standard error of the coefficient is always positive.

The smaller the standard error, the more precise the estimate. Dividing the coefficient by its standard error calculates a t-value.

tertiary_TV <- tm_shape(Osun)+
    tm_polygons(alpha=0.1)+
    tm_shape(gwr_sf.fixed)+
    tm_dots(col="distance_to_tertiary_road_TV",
            border.col="gray60",
            border.lwd = 1)+
    tm_view(set.zoom.limits = c(8,14))
tertiary_SE <- tm_shape(Osun)+
    tm_polygons(alpha=0.1)+
    tm_shape(gwr_sf.fixed)+
    tm_dots(col="distance_to_tertiary_road_SE",
            border.col="gray60",
            border.lwd = 1)+
    tm_view(set.zoom.limits = c(8,14))
tmap_arrange(tertiary_SE, tertiary_TV, asp=1, ncol=2, sync=TRUE)
Variable(s) "distance_to_tertiary_road_TV" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

With reference to the plot on the left, we can see that the standard error of coefficient for distance_to_tertiary_road is generally low for all areas (as indicated by the yellow dots), with exception of dots in red colour, i.e. LGAs in Atakumosa East.

9 Building Logistic Regression and Geographically Weighted Logistic Regression Models using only Statistically Significant Independent Variables

In this section, we will only use independent variables that are statistically significant. Like before, we will build both logistic regression model and geographically weighted logistic regression model and then compare the 2 models.

9.1 Logistic regression model

In the following code check, we will use glm of R to calibrate a logistic regression model for the water point status. We will exclude fields that are not statistically significant, i.e. distance_to_primary_road and distance_to_secondary_road.

model_sig <- glm(status ~ distance_to_tertiary_road+
                       distance_to_city+
                       distance_to_town+
                       is_urban+
                       usage_capacity+
                       water_source_clean+
                       water_point_population+
                       local_population_1km,
                   data = Osun_wp_sf_clean,
                   family = binomial(link = 'logit'))
blr_regress(model_sig)
                             Model Overview                              
------------------------------------------------------------------------
Data Set    Resp Var    Obs.    Df. Model    Df. Residual    Convergence 
------------------------------------------------------------------------
  data       status     4756      4755           4746           TRUE     
------------------------------------------------------------------------

                    Response Summary                     
--------------------------------------------------------
Outcome        Frequency        Outcome        Frequency 
--------------------------------------------------------
   0             2114              1             2642    
--------------------------------------------------------

                                 Maximum Likelihood Estimates                                   
-----------------------------------------------------------------------------------------------
               Parameter                    DF    Estimate    Std. Error    z value     Pr(>|z|) 
-----------------------------------------------------------------------------------------------
              (Intercept)                   1      0.3540        0.1055      3.3541       8e-04 
       distance_to_tertiary_road            1      1e-04         0.0000      4.9096      0.0000 
            distance_to_city                1      0.0000        0.0000     -5.2022      0.0000 
            distance_to_town                1      0.0000        0.0000     -5.4660      0.0000 
              is_urbanTRUE                  1     -0.2667        0.0747     -3.5690       4e-04 
           usage_capacity1000               1     -0.6206        0.0697     -8.9081      0.0000 
water_source_cleanProtected Shallow Well    1      0.4947        0.0850      5.8228      0.0000 
   water_source_cleanProtected Spring       1      1.2790        0.4384      2.9174      0.0035 
         water_point_population             1      -5e-04        0.0000    -11.3902      0.0000 
          local_population_1km              1      3e-04         0.0000     19.4069      0.0000 
-----------------------------------------------------------------------------------------------

 Association of Predicted Probabilities and Observed Responses  
---------------------------------------------------------------
% Concordant          0.7349          Somers' D        0.4697   
% Discordant          0.2651          Gamma            0.4697   
% Tied                0.0000          Tau-a            0.2320   
Pairs                5585188          c                0.7349   
---------------------------------------------------------------

We can see that all independent variables used to build this model are statistically significant (i.e. p-value < 0.05).

blr_confusion_matrix(model_sig, cutoff = 0.5)
Confusion Matrix and Statistics 

          Reference
Prediction FALSE TRUE
         0  1300  743
         1   814 1899

                Accuracy : 0.6726 
     No Information Rate : 0.4445 

                   Kappa : 0.3348 

McNemars's Test P-Value  : 0.0761 

             Sensitivity : 0.7188 
             Specificity : 0.6149 
          Pos Pred Value : 0.7000 
          Neg Pred Value : 0.6363 
              Prevalence : 0.5555 
          Detection Rate : 0.3993 
    Detection Prevalence : 0.5704 
       Balanced Accuracy : 0.6669 
               Precision : 0.7000 
                  Recall : 0.7188 

        'Positive' Class : 1

We can also see that our accuracy is approximately 67%. The sensitivity is also higher than the specificity, indicating that our true positive is higher (model has correctly determined approximately 72% true positive) than the true negative (model correctly determines approximately 61% true negative).

9.2 Geographically weighted logistic regression model (with fixed bandwidth)

9.2.1 Converting from sf to sp data frame

We will first convert the data to a SpatialPointsDataFrame data type for compatibility with subsequent packages. In here, we have also excluded fields that are not statistically significant, i.e. distance_to_primary_road and distance_to_secondary_road.

Osun_wp_sp_sig <- Osun_wp_sf_clean %>% 
    select(c(status,
             distance_to_tertiary_road,
             distance_to_city,
             distance_to_town,
             is_urban,
             usage_capacity,
             water_source_clean,
             water_point_population,
             local_population_1km)) %>% 
    as_Spatial()
Osun_wp_sp_sig
class       : SpatialPointsDataFrame 
features    : 4756 
extent      : 182502.4, 290751, 340054.1, 450905.3  (xmin, xmax, ymin, ymax)
crs         : +proj=tmerc +lat_0=4 +lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +towgs84=-92,-93,122,0,0,0,0 +units=m +no_defs 
variables   : 9
names       : status, distance_to_tertiary_road, distance_to_city, distance_to_town, is_urban, usage_capacity, water_source_clean, water_point_population, local_population_1km 
min values  :      0,         0.017815121653488, 53.0461399623541, 30.0019777713073,        0,           1000,           Borehole,                      0,                    0 
max values  :      1,          10966.2705628969,  47934.343603562, 44020.6393368124,        1,            300,   Protected Spring,                  29697,                36118 

9.2.2 Computing fixed bandwidth

Here, we will compute the fixed bandwidth that we will use to build the geographically weighted logistic regression model.

bw.fixed.sig <- bw.ggwr(status ~ distance_to_tertiary_road+
                       distance_to_city+
                       distance_to_town+
                       is_urban+
                       usage_capacity+
                       water_source_clean+
                       water_point_population+
                       local_population_1km,
                   data = Osun_wp_sp_sig,
                   family = "binomial",
                   approach = "AIC",
                   kernel = "gaussian",
                   adaptive = FALSE,
                   longlat = FALSE)
Take a cup of tea and have a break, it will take a few minutes.
          -----A kind suggestion from GWmodel development group
 Iteration    Log-Likelihood:(With bandwidth:  95768.67 )
=========================
       0        -2890 
       1        -2837 
       2        -2830 
       3        -2829 
       4        -2829 
       5        -2829 
Fixed bandwidth: 95768.67 AICc value: 5681.18 
 Iteration    Log-Likelihood:(With bandwidth:  59200.13 )
=========================
       0        -2878 
       1        -2820 
       2        -2812 
       3        -2810 
       4        -2810 
       5        -2810 
Fixed bandwidth: 59200.13 AICc value: 5645.901 
 Iteration    Log-Likelihood:(With bandwidth:  36599.53 )
=========================
       0        -2854 
       1        -2790 
       2        -2777 
       3        -2774 
       4        -2774 
       5        -2774 
       6        -2774 
Fixed bandwidth: 36599.53 AICc value: 5585.354 
 Iteration    Log-Likelihood:(With bandwidth:  22631.59 )
=========================
       0        -2810 
       1        -2732 
       2        -2711 
       3        -2707 
       4        -2707 
       5        -2707 
       6        -2707 
Fixed bandwidth: 22631.59 AICc value: 5481.877 
 Iteration    Log-Likelihood:(With bandwidth:  13998.93 )
=========================
       0        -2732 
       1        -2635 
       2        -2604 
       3        -2597 
       4        -2596 
       5        -2596 
       6        -2596 
Fixed bandwidth: 13998.93 AICc value: 5333.718 
 Iteration    Log-Likelihood:(With bandwidth:  8663.649 )
=========================
       0        -2624 
       1        -2502 
       2        -2459 
       3        -2447 
       4        -2446 
       5        -2446 
       6        -2446 
       7        -2446 
Fixed bandwidth: 8663.649 AICc value: 5178.493 
 Iteration    Log-Likelihood:(With bandwidth:  5366.266 )
=========================
       0        -2478 
       1        -2319 
       2        -2250 
       3        -2225 
       4        -2219 
       5        -2219 
       6        -2220 
       7        -2220 
       8        -2220 
       9        -2220 
Fixed bandwidth: 5366.266 AICc value: 5022.016 
 Iteration    Log-Likelihood:(With bandwidth:  3328.371 )
=========================
       0        -2222 
       1        -2002 
       2        -1894 
       3        -1838 
       4        -1818 
       5        -1814 
       6        -1814 
Fixed bandwidth: 3328.371 AICc value: 4827.587 
 Iteration    Log-Likelihood:(With bandwidth:  2068.882 )
=========================
       0        -1837 
       1        -1528 
       2        -1357 
       3        -1261 
       4        -1222 
       5        -1222 
Fixed bandwidth: 2068.882 AICc value: 4772.046 
 Iteration    Log-Likelihood:(With bandwidth:  1290.476 )
=========================
       0        -1403 
       1        -1016 
       2       -807.3 
       3       -680.2 
       4       -680.2 
Fixed bandwidth: 1290.476 AICc value: 5809.719 
 Iteration    Log-Likelihood:(With bandwidth:  2549.964 )
=========================
       0        -2019 
       1        -1753 
       2        -1614 
       3        -1538 
       4        -1506 
       5        -1506 
Fixed bandwidth: 2549.964 AICc value: 4764.056 
 Iteration    Log-Likelihood:(With bandwidth:  2847.289 )
=========================
       0        -2108 
       1        -1862 
       2        -1736 
       3        -1670 
       4        -1644 
       5        -1644 
Fixed bandwidth: 2847.289 AICc value: 4791.834 
 Iteration    Log-Likelihood:(With bandwidth:  2366.207 )
=========================
       0        -1955 
       1        -1675 
       2        -1525 
       3        -1441 
       4        -1407 
       5        -1407 
Fixed bandwidth: 2366.207 AICc value: 4755.524 
 Iteration    Log-Likelihood:(With bandwidth:  2252.639 )
=========================
       0        -1913 
       1        -1623 
       2        -1465 
       3        -1376 
       4        -1341 
       5        -1341 
Fixed bandwidth: 2252.639 AICc value: 4759.188 
 Iteration    Log-Likelihood:(With bandwidth:  2436.396 )
=========================
       0        -1980 
       1        -1706 
       2        -1560 
       3        -1479 
       4        -1446 
       5        -1446 
Fixed bandwidth: 2436.396 AICc value: 4756.675 
 Iteration    Log-Likelihood:(With bandwidth:  2322.828 )
=========================
       0        -1940 
       1        -1656 
       2        -1503 
       3        -1417 
       4        -1382 
       5        -1382 
Fixed bandwidth: 2322.828 AICc value: 4756.471 
 Iteration    Log-Likelihood:(With bandwidth:  2393.017 )
=========================
       0        -1965 
       1        -1687 
       2        -1539 
       3        -1456 
       4        -1422 
       5        -1422 
Fixed bandwidth: 2393.017 AICc value: 4755.57 
 Iteration    Log-Likelihood:(With bandwidth:  2349.638 )
=========================
       0        -1949 
       1        -1668 
       2        -1517 
       3        -1432 
       4        -1398 
       5        -1398 
Fixed bandwidth: 2349.638 AICc value: 4755.753 
 Iteration    Log-Likelihood:(With bandwidth:  2376.448 )
=========================
       0        -1959 
       1        -1680 
       2        -1530 
       3        -1447 
       4        -1413 
       5        -1413 
Fixed bandwidth: 2376.448 AICc value: 4755.48 
 Iteration    Log-Likelihood:(With bandwidth:  2382.777 )
=========================
       0        -1961 
       1        -1683 
       2        -1534 
       3        -1450 
       4        -1416 
       5        -1416 
Fixed bandwidth: 2382.777 AICc value: 4755.491 
 Iteration    Log-Likelihood:(With bandwidth:  2372.536 )
=========================
       0        -1958 
       1        -1678 
       2        -1528 
       3        -1445 
       4        -1411 
       5        -1411 
Fixed bandwidth: 2372.536 AICc value: 4755.488 
 Iteration    Log-Likelihood:(With bandwidth:  2378.865 )
=========================
       0        -1960 
       1        -1681 
       2        -1532 
       3        -1448 
       4        -1414 
       5        -1414 
Fixed bandwidth: 2378.865 AICc value: 4755.481 
 Iteration    Log-Likelihood:(With bandwidth:  2374.954 )
=========================
       0        -1959 
       1        -1679 
       2        -1530 
       3        -1446 
       4        -1412 
       5        -1412 
Fixed bandwidth: 2374.954 AICc value: 4755.482 
 Iteration    Log-Likelihood:(With bandwidth:  2377.371 )
=========================
       0        -1959 
       1        -1680 
       2        -1531 
       3        -1447 
       4        -1413 
       5        -1413 
Fixed bandwidth: 2377.371 AICc value: 4755.48 
 Iteration    Log-Likelihood:(With bandwidth:  2377.942 )
=========================
       0        -1960 
       1        -1680 
       2        -1531 
       3        -1448 
       4        -1414 
       5        -1414 
Fixed bandwidth: 2377.942 AICc value: 4755.48 
 Iteration    Log-Likelihood:(With bandwidth:  2377.018 )
=========================
       0        -1959 
       1        -1680 
       2        -1531 
       3        -1447 
       4        -1413 
       5        -1413 
Fixed bandwidth: 2377.018 AICc value: 4755.48 
bw.fixed.sig
[1] 2377.371

We obtain a fixed bandwidth of 2377.371 m (projection in Nigeria is in metres), which is approximately 2.4 km.

9.2.3 Building fixed bandwidth model

Next, we will build a geographically weighted logistic regression model using the fixed bandwidth determined in the previous section.

gwlr.fixed.sig <- ggwr.basic(status ~ distance_to_tertiary_road+
                       distance_to_city+
                       distance_to_town+
                       is_urban+
                       usage_capacity+
                       water_source_clean+
                       water_point_population+
                       local_population_1km,
                   data = Osun_wp_sp,
                   bw = bw.fixed.sig,
                   family = "binomial",
                   kernel = "gaussian",
                   adaptive = FALSE,
                   longlat = FALSE)
 Iteration    Log-Likelihood
=========================
       0        -1959 
       1        -1680 
       2        -1531 
       3        -1447 
       4        -1413 
       5        -1413 
gwlr.fixed.sig
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2022-12-18 00:24:15 
   Call:
   ggwr.basic(formula = status ~ distance_to_tertiary_road + distance_to_city + 
    distance_to_town + is_urban + usage_capacity + water_source_clean + 
    water_point_population + local_population_1km, data = Osun_wp_sp, 
    bw = bw.fixed.sig, family = "binomial", kernel = "gaussian", 
    adaptive = FALSE, longlat = FALSE)

   Dependent (y) variable:  status
   Independent variables:  distance_to_tertiary_road distance_to_city distance_to_town is_urban usage_capacity water_source_clean water_point_population local_population_1km
   Number of data points: 4756
   Used family: binomial
   ***********************************************************************
   *              Results of Generalized linear Regression               *
   ***********************************************************************

Call:
NULL

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-129.368    -1.750     1.074     1.742    34.126  

Coefficients:
                                           Estimate Std. Error z value Pr(>|z|)
Intercept                                 3.540e-01  1.055e-01   3.354 0.000796
distance_to_tertiary_road                 1.001e-04  2.040e-05   4.910 9.13e-07
distance_to_city                         -1.764e-05  3.391e-06  -5.202 1.97e-07
distance_to_town                         -1.544e-05  2.825e-06  -5.466 4.60e-08
is_urbanTRUE                             -2.667e-01  7.474e-02  -3.569 0.000358
usage_capacity1000                       -6.206e-01  6.966e-02  -8.908  < 2e-16
water_source_cleanProtected Shallow Well  4.947e-01  8.496e-02   5.823 5.79e-09
water_source_cleanProtected Spring        1.279e+00  4.384e-01   2.917 0.003530
water_point_population                   -5.098e-04  4.476e-05 -11.390  < 2e-16
local_population_1km                      3.452e-04  1.779e-05  19.407  < 2e-16
                                            
Intercept                                ***
distance_to_tertiary_road                ***
distance_to_city                         ***
distance_to_town                         ***
is_urbanTRUE                             ***
usage_capacity1000                       ***
water_source_cleanProtected Shallow Well ***
water_source_cleanProtected Spring       ** 
water_point_population                   ***
local_population_1km                     ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6534.5  on 4755  degrees of freedom
Residual deviance: 5688.9  on 4746  degrees of freedom
AIC: 5708.9

Number of Fisher Scoring iterations: 5


 AICc:  5708.923
 Pseudo R-square value:  0.129406
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Fixed bandwidth: 2377.371 
   Regression points: the same locations as observations are used.
   Distance metric: A distance matrix is specified for this model calibration.

   ************Summary of Generalized GWR coefficient estimates:**********
                                                   Min.     1st Qu.      Median
   Intercept                                -3.7021e+02 -4.3797e+00  3.5590e+00
   distance_to_tertiary_road                -3.1622e-02 -4.5462e-04  9.1291e-05
   distance_to_city                         -5.4555e-02 -6.5623e-04 -1.3507e-04
   distance_to_town                         -8.6549e-03 -5.2754e-04 -1.6785e-04
   is_urbanTRUE                             -7.3554e+02 -3.4675e+00 -1.6596e+00
   usage_capacity1000                       -5.5889e+01 -1.0347e+00 -4.1960e-01
   water_source_cleanProtected.Shallow.Well -1.8842e+02 -4.7295e-01  6.2378e-01
   water_source_cleanProtected.Spring       -1.3630e+03 -5.3436e+00  2.7714e+00
   water_point_population                   -2.9696e-02 -2.2705e-03 -1.2277e-03
   local_population_1km                     -7.7730e-02  4.4281e-04  1.0548e-03
                                                3rd Qu.      Max.
   Intercept                                 1.3755e+01 2171.6373
   distance_to_tertiary_road                 6.3011e-04    0.0237
   distance_to_city                          1.5921e-04    0.0162
   distance_to_town                          2.4490e-04    0.0179
   is_urbanTRUE                              1.0554e+00  995.1840
   usage_capacity1000                        3.9113e-01    9.2449
   water_source_cleanProtected.Shallow.Well  1.9564e+00   66.8914
   water_source_cleanProtected.Spring        7.0805e+00  208.3749
   water_point_population                    4.5879e-04    0.0765
   local_population_1km                      1.8479e-03    0.0333
   ************************Diagnostic information*************************
   Number of data points: 4756 
   GW Deviance: 2815.659 
   AIC : 4418.776 
   AICc : 4744.213 
   Pseudo R-square value:  0.5691072 

   ***********************************************************************
   Program stops at: 2022-12-18 00:25:19 

From the results, we can see that the Geographically Weighted Regression model has a lower AIC (i.e. 4418.776) compared to the Generalised Linear Regression (AIC = 5708.9). This tells us that Geographically Weighted Regression model has improved explainability.

9.3 Model Assessment

We will first convert the model into SFD object as data.frame using the following code chunk.

gwr.fixed.sig <- as.data.frame(gwlr.fixed.sig$SDF)

Next, we will label the yhat values (i.e. predicted probability) greater or equal to 0.5 into 1 and else 0. The result of the logic comparison operation will be saved into a field called most.

gwr.fixed.sig <- gwr.fixed.sig %>% 
    mutate(most = ifelse(
        gwr.fixed.sig$yhat >= 0.5, T, F))

We will use confusionMatrix() from caret package to generate the confusion matrix.

gwr.fixed.sig$y <- as.factor(gwr.fixed.sig$y)
gwr.fixed.sig$most <- as.factor(gwr.fixed.sig$most)
CM <- confusionMatrix(data=gwr.fixed.sig$most, reference = gwr.fixed.sig$y)
CM
Confusion Matrix and Statistics

          Reference
Prediction FALSE TRUE
     FALSE  1833  268
     TRUE    281 2374
                                          
               Accuracy : 0.8846          
                 95% CI : (0.8751, 0.8935)
    No Information Rate : 0.5555          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.7661          
                                          
 Mcnemar's Test P-Value : 0.6085          
                                          
            Sensitivity : 0.8671          
            Specificity : 0.8986          
         Pos Pred Value : 0.8724          
         Neg Pred Value : 0.8942          
             Prevalence : 0.4445          
         Detection Rate : 0.3854          
   Detection Prevalence : 0.4418          
      Balanced Accuracy : 0.8828          
                                          
       'Positive' Class : FALSE           
                                          

When we compare the overall accuracy is now improved to 88.46% (geographically weighted) compared to the 67.26% that we obtained initially in the non-geographically weighted logistic regression model. In addition, we also noted that both sensitivity and specificity increased significantly from 71.88% to 86.71% and 61.49% to 89.86% respectively. This implies we should apply a local strategy (i.e. gwLR - looking at surrounding neighbours) instead of a global strategy to understand the factors for water points being functional or non-functional.

9.4 Visualising gwLR

The following code chunk below is used to create an interactive point symbol map.

gwr_sf.fixed.sig <- cbind(Osun_wp_sf_selected, gwr.fixed.sig)
tmap_mode("view")
tmap mode set to interactive viewing
prob_T <- tm_shape(Osun)+
    tm_polygons(alpha = 0.1)+
    tm_shape(gwr_sf.fixed.sig)+
    tm_dots(col = "yhat",
            border.col = "gray60",
            border.lwd = 1)+
    tm_view(set.zoom.limits = c(8,14))
prob_T

Likewise, we will visualise how the standard error of coefficient and t-value for the field distance_to_tertiary_road differs for the local models obtained for each state in Osun.

tertiary_TV <- tm_shape(Osun)+
    tm_polygons(alpha=0.1)+
    tm_shape(gwr_sf.fixed.sig)+
    tm_dots(col="distance_to_tertiary_road_TV",
            border.col="gray60",
            border.lwd = 1)+
    tm_view(set.zoom.limits = c(8,14))
tertiary_SE <- tm_shape(Osun)+
    tm_polygons(alpha=0.1)+
    tm_shape(gwr_sf.fixed.sig)+
    tm_dots(col="distance_to_tertiary_road_SE",
            border.col="gray60",
            border.lwd = 1)+
    tm_view(set.zoom.limits = c(8,14))
tmap_arrange(tertiary_SE, tertiary_TV, asp=1, ncol=2, sync=TRUE)
Variable(s) "distance_to_tertiary_road_TV" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

With reference to the plot for standard error of coefficient, we can see that standard error of coefficient is generally low for all states (as indicated by pale yellow dots). However, we can see that for several water points in Aiyadade, the local model has a high standard error of coefficient for distance_to_tertiary_road. We can also observe from Section 8.4 and the plot here, that the gwLR model generated in section 8 and section 9 gives different local performance.